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METHOD AND APPARATUS FOR PIPELINED JOINT EQUALIZATION AND 
DECODING FOR GIGABIT COMMUNICATIONS 

Cross Reference to Related Applications 

5 This application claims the benefit of United States Provisional Application 

Number 60/245,519, filed November 3, 2000. 



Field of the Invention 

The present invention relates generally to channel equalization and decoding 
10 techniques, and more particularly, to sequence estimation techniques with shorter critical paths. 

Background of the Invention 

o — 

The transmission rates for local area networks (LANs) that use unshielded twisted 
iTl pair (UTP) copper cabling have progressively increased from 10 Megabits-per-second (Mbps) to 
;P 15 1 Gigabit-per-second (Gbps). The Gigabit Ethernet 1000 Base-T standard, for example, operates 
fn at a clock rate of 125 MHz and uses UTP cabling of Category 5 with four pairs to transmit 1 
T Gbps. Trellis-coded modulation (TCM) is employed by the transmitter, in a known manner, to 
achieve coding gain. The signals arriving at the receiver are typically corrupted by intersymbol 
H interference (ISI), crosstalk, echo, and noise. A major challenge for 1000 Base-T receivers is to 
n 20 jointly equalize the channel and decode the corrupted trellis-coded signals at the demanded clock 
rate of 125 MHz, as the algorithms for joint equalization and decoding incorporate non-linear 
feedback loops that cannot be pipelined. 

Data detection is often performed using maximum likelihood sequence 
estimation, to produce the output symbols or bits. A maximum likelihood sequence estimator 
25 considers all possible sequences and determines which sequence was actually transmitted, in a 
known manner. The maximum likelihood sequence estimator is the optimum decoder and 
apphes the well-known Viterbi algorithm to perform joint equalization and decoding. For a more 
detailed discussion of a Viterbi implementation of a maximum likelihood sequence estimator 
(MLSE), see Gerhard Fettweis and Heinrich Meyr, "High-Speed Parallel Viterbi Decoding 
30 Algorithm and VLSI- Architecture," IEEE Communication Magazine (May 1991), incorporated 
by reference herein. 
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In order to reduce the hardware complexity for the maximum likelihood sequence 
estimator that applies the Viterbi algorithm, a number of sub-optimal approaches which are 
referred to as reduced-state sequence estimation (RSSE) have been proposed. For a discussion of 
reduced state sequence estimation techniques, as well as the special cases of decision-feedback 
5 sequence estimation (DFSE) and parallel decision-feedback decoding (PDFD) techniques, see, 
for example, P. R. Chevillat and E. Eleftheriou, "Decoding of Trellis-Encoded Signals in the 
Presence of Intersymbol Interference and Noise", IEEE Trans. Commun., vol. 37, 669-76, (July 
1989), M. V. Eyuboglu and S. U. H. Qureshi, "Reduced-State Sequence Estimation For Coded 
Modulation On Intersymbol Interference Channels", IEEE JSAC, vol. 7, 989-95 (Aug. 1989), or 
10 A. Duel-Hallen and C. Heegard, "Delayed Decision-Feedback Sequence Estimation," IEEE 
Trans. Commun., vol. 37, pp. 428-436, May 1989, each incorporated by reference herein. 
O Generally, reduced state sequence estimation techniques reduce the complexity of 

the maximum likelihood sequence estimators by merging several states. The RSSE technique 
incorporates non-linear feedback loops that cannot be pipelined. The critical path associated 
^ 15 with these feedback loops is the limiting factor for high-speed implementations. 
|j United States Patent Application Serial Number 09/326,785, filed June 4, 1999 

1^ and entitled "Method and Apparatus for Reducing the Computational Complexity and Relaxing 
:P the Critical Path of Reduced State Sequence Estimation (RSSE) Techniques," incorporated by 
y reference herein, discloses a technique that reduces the hardware complexity of RSSE for a given 
ii 20 number of states and also relaxes the critical path problem. United States Patent Application 
Serial Number 09/471,920, filed December 23, 1999, entitled "Method and Apparatus for 
Shortening the Critical Path of Reduced Complexity Sequence Estimation Techniques," 
incorporated by reference herein, discloses a technique that improves the throughput of RSSE by 
pre-computing the possible values for the branch metrics in a look-ahead fashion to permit 
25 pipelining and the shortening of the critical path. The complexity of the pre-computation 
technique, however, increases exponentially with the length of the channel impulse response. In - 
addition, the delay through the selection circuitry that selects the actual branch metrics among all 
precomputed ones increases with L, eventually neutralizing the speed gain achieved by the 
precomputation. 
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A need therefore exists for a technique that increases the throughput of RSSE 
algorithms using precomputations with only a linear increase in hardware complexity with 
respect to the look-ahead computation depth. 



5 Summary of the Invention 

Generally, a method and apparatus are disclosed for the implementation of 
reduced state sequence estimation with an increased throughput using precomputations (look- 
ahead), while only introducing a linear increase in hardware complexity with respect to the look- 
ahead depth. RSSE techniques typically decode a received signal and compensate for 
10 intersymbol interference using a decision feedback unit (DFU), a branch metrics unit (BNRJ), an 
add-compare-select unit (ACSU) and a survivor memory unit (SMU). The present invention 
Q limits the increase in hardware complexity by taking advantage of past decisions. The past 
m decision may be a past ACS decision of the ACSU or a past survivor symbol in the SMU or a 
combination thereof. The critical path of a conventional RSSE implementation is broken up into 
T 15 at least two smaller critical paths using pipeline registers. 

m A reduced state sequence estimator is disclosed that employs a one-step look- 

ahead technique to process a signal received from a dispersive channel having a channel memory. 
Initially, a speculative intersymbol interference estimate is precomputed based on a combination 
of (i) a speculative partial intersymbol interference estimate for a first postcursor tap of the 
20 channel impulse response, based on each possible value for a data symbol, and (ii) a combination 
of partial intersymbol interference estimates for each subsequent postcursor tap of the channel 
impulse response, where at least one of the partial intersymbol interference estimates for the 
subsequent postcursor taps is based on a past survivor symbol from the corresponding state. In 
addition, a branch metric is precomputed based on the precomputed intersymbol interference 
25 estimate. One of the precomputed branch metrics is selected based on a past decision from the 
corresponding state. The past decision may be a past ACS decision of the ACSU or a past 
survivor symbol in the SMU or a combination of both. The selected branch metric is used to 
compute new path metrics for path extensions from a corresponding state. The computed new 
path metrics are used to determine the best survivor path and path metric for a corresponding 
30 state. 
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A reduced state sequence estimator is also disclosed that employs a multiple-step 
look-ahead technique to process a signal received from a dispersive channel having a channel 
memory. Initially, a speculative partial intersymbol interference estimate is precomputed for 
each of a plurality of postcursor taps of the channel impulse response, based on each possible 
5 value for a data symbol. Thereafter, a partial intersymbol interference estimate is selected for 
each of the plurality of postcursor taps other than a first postcursor tap based on a past decision 
from a corresponding state. The past decision may be a past ACS decision of the ACSU or a past 
survivor symbol in the SMU or a combination of both. A precomputed partial intersymbol 
interference estimate for the first postcursor tap is referred to as a precomputed intersymbol 
10 interference estimate. In addition, speculative branch metrics are precomputed based on the 
precomputed intersymbol interference estimates. One of the precomputed branch metrics is 

P selected based on a past decision from a corresponding state. The past decision may be a past 

5 ACS decision of the ACSU or a past survivor symbol in the SMU or a combination of both. The 
selected branch metric is used to compute new path metrics for path extensions from a 

T 15 corresponding state. The computed new path metrics are used to determine the best survivor 

|g path and path metric for a corresponding state. 

L In further variations, intersymbol estimates can be selected among precomputed 

:P intersymbol interference estimates without precomputing branch metrics or the partial 
y intersymbol interference estimates can be precomputed for a group of taps, with a 
K 20 precomputation for all possible data symbol combinations corresponding to the groups of taps 

and selection for each group. 

A more complete understanding of the present invention, as well as further 

features and advantages of the present invention, will be obtained by reference to the following 

detailed description and drawings. 

25 

Brief Description of the Drawings 

FIG. 1 illustrates a channel impulse response with channel memory, L; 

FIG. 2 illustrates a communication system in which the present invention may 

operate; 
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FIG. 3 illustrates a trellis associated with a channel of memory length L=l and 
binary data symbols; 

FIG. 4 illustrates a block diagram for an implementation of the Viterbi algorithm 

(VA); 

5 FIG. 5 illustrates a state-parallel implementation of the ACSU of FIG. 4 for a 

channel of memory L=l ; 

FIG. 6 is a table analyzing the complexity and critical path of MLSE and RSSE 

techniques; 

FIG. 7A illustrates the architecture of a reduced state sequence estimator; 
10 FIG. 7B illustrates an implementation of the look-up tables in the BMU of FIG. 

7A; 



FIG. 8 illustrates an exemplary look-ahead architecture for an RSSE algorithm 
^ with one-step look-ahead in accordance with one embodiment of the present invention; 

FIG. 9 illustrates an exemplary look-ahead architecture for an RSSE algorithm 
1^" 15 with multiple-step look-ahead in accordance with another embodiment of the present invention; 
W HG. 10 is a table analyzing the complexity and critical path of a pipelined RSSE 

□ in accordance with the present invention; 

B 

^ HGS. 11 and 12 illustrate alternate implementations of the RSSE algorithms with 

]^ one-step look-ahead (FIG. 8) and multiple-step look-ahead (FIG. 9), respectively; 

20 FIG. 13 illustrates a trellis for a multi-dimensional trelhs code, such as the 

lOOOBASE-T trellis code; 

FIG. 14 is a schematic block diagram illustrating a pipelined parallel decision 
feedback decoder (PDFD) architecture that decodes the lOOOBASE-T trellis code and equalizes 
intersymbol interference in accordance with the present invention; 
25 FIG. 15 is a schematic block diagram illustrating an embodiment of the look- 

ahead decision feedback unit (LA-DFU) of FIG. 14; 

FIG. 16 is a schematic block diagram illustrating an embodiment of the 
intersymbol interference selection unit (ISI-MUXU) of FIG. 14; 

FIG. 17 is a schematic block diagram illustrating an embodiment of the one 
30 dimensional look-ahead branch metrics unit (ID-LA-BMU) of FIG. 14; and 
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FIG. 18 is a schematic block diagram illustrating an embodiment of the survivor 
memory unit (SMU) of HG. 14. 

Detailed Description 

As previously indicated, the processing speed of conventional reduced state 
sequence estimation (RSSE) implementations is hmited by a recursive feedback loop. According 
to one feature of the present invention, the processing speed of reduced state sequence estimation 
implementations is improved by pipelining the branch metric and decision-feedback 
computations, such that the critical path is reduced to be of the same order as in a traditional 
Viterbi decoder. The additional hardware required by the present invention scales only linearly 
with the look-ahead depth. The presented algorithm allows the VLSI implementation of RSSE 
for high-speed applications such as Gigabit Ethernet over copper. Reduced complexity sequence 
estimation techniques are disclosed for uncoded signals, where the underlying trellis has no 
parallel state transitions, as well as for signals encoded with a multi-dimensional trelHs code 
having parallel transitions, such as signals encoded according to the lOOOBASE-T Ethernet 
standard. It should be understood that the disclosed pipelining technique can be applied 
whenever RSSE is being used, e.g., to any kind of trellis or modulations scheme. The disclosed 
examples are used for illustration purposes only and do not intend to limit the scope of the 
invention. 

System Model 

FIG. 1 illustrates a channel impulse response with channel memory, L. As shown 
in FIG. 1, there is a main tap corresponding to time 0, and there are L postcursor taps. The first 
K postcursor taps shown in FIG. 1 after the main tap are used for the construction of the reduced- 
state trelHs, as discussed below. 

FIG. 2 illustrates a conmiunication system 200 having a channel 210 and a 
sequence estimator 220. The output of the channel 210 at time n is given by 

L 

Zn=I.fi'^n-i-^^n y (1) 
1=0 

where {/,}, 0</<L are the finite impulse response channel coefficients (/o=l is assumed 
without loss of generality), L is the channel memory, a„ is the data symbol at time n, and w„ is 
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zero-mean Gaussian noise. The decision of the sequence estimator 220 corresponding to is 
denoted by a^. While the illustrative embodiment assumes that the symbols are binary, i.e., 
an={-ll}, and trellis-coded modulation (TCM) is not employed. The present invention may be 
applied, however, to non-binary modulation and TCM, such as the coding and modulation 
5 scheme used in Gigabit Ethernet over copper, as would be apparent to a person of ordinary skill 
in the art. 

The optimum method for the recovery of the transmitted symbols is MLSE, which 
applies the Viterbi algorithm (VA) to the trellis defined by the channel state 

Pn =K-l'««-2v...a,i-L) • (2) 

10 A binary symbol constellation is assumed. Thus, the number of states processed by the VA is 
given by: 

5 = 2S (3) 
and two branches leave or enter each state of the trellis. FIG. 3 shows a trellis 300 associated 
with a channel of memory length L=l. The branch metric for a transition from state p„ under 
15 input is given by 

(Zn ^^n^Pn) = (Zn-an- Em Mn-i ) ^ • (4) 

The VA determines the best survivor path into state p^+i from the two 
predecessor states {p„} by evaluating the following add-compare-select (ACS) function: 



m 



Q 



i = s 



^n+\iPn+\)= (r^(Pn)-^K(Zn,a^.Pn))^ (5) 

{p„}^Pn+i 

20 where r„(p„) is the path metric for state . 

The block diagram for an implementation of the VA is shown in FIG. 4. As 
shown in FIG. 4, the VA 400 includes a branch metrics unit (BMU) 410, an add-compare-select 
unit (ACSU) 420 and a survivor memory unit (SMU) 430. The BMU 410 calculates the 2^""* 
branch metrics (BMs), the ACSU 420 performs the ACS operation for each of the S states, and 
25 the SMU 430 keeps track of the S survivor paths. 

The ACSU 420 is the bottleneck for maximum throughput as the operations in the 
BMU 410 and SMU 430 are feedforward and can thus be pipehned using pipeline registers 415 
and 425. A state-parallel implementation of the ACSU 420 yields the highest processing speed 
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and is shown in HG. 5 for a channel of memory L=l (the corresponding trellis was shown in 
FIG, 3). 

The recursive loop of the ACS operation associated with equation (5) determines 
the critical path in the ACSU 420, as it cannot be pipelined. It can be seen from FIG. 5 that this 
5 loop comprises one addition (ADD) 510, one 2-way comparison 520, whose delay is about the 
same as one ADD, and a 2-way selection 530, corresponding to a 2-to-l multiplexer (MUX). 
Hereinafter, shift registers will not be considered in the critical path analysis due to their minor 
delay. FIG. 6 is a table 600 analyzing the complexity and critical path of MLSE and RSSE. 
Column 620 of table 600 summarizes the computational complexity and critical path of MLSE 
10 for binary signals corrupted by a channel of memory L. It is noted that in addition to a state- 
parallel implementation shown in FIG. 5, the throughput of the VA can be even further increased 
O by introducing parallelism on the bit, block and algorithmic level (for a good sunmiary, see e.g. 
II H. Meyr, M. Moeneclaey, and S.A. Fechtel, Digital Conmaunication Receivers, John Wiley & 
Sons, pp. 568-569, 1998). However, this comes at a significant increase in complexity and/or 
iJ^15 latency. 

m 

^ Reduced-State Sequence Estimation 

V: 

a RSSE reduces the complexity of MLSE by truncating the channel memory p„ , as 

C described in A. Duel-Hallen and C. Heegard, "Delayed Decision-Feedback Sequence 
Estimation," IEEE Trans. Commun., vol. 37, 428-436, May 1989, or applying set partitioning to 
i=*20 the signal alphabet as described in P. R. Chevillat and E. Eleftheriou, "Decoding of Trellis- 
Encoded Signals in the Presence of Intersymbol Interference and Noise," IEEE Trans. Commun., 
vol. 37, pp. 669-676, Jul. 1989 or M.V. Eyuboglu and S.U. Qureshi, "Reduced-State Sequence 
Estimation for Coded Modulation on Intersymbol Interference Channels," IEEE J. Sel. Areas 
Commun., vol. 7, pp. 989-995, Aug. 1989. Similar to the VA, RSSE searches for the most likely 
25 data sequence in the reduced trellis by keeping only the best survivor path for each reduced state. 
In the exemplary embodiment discussed herein, the reduced state p'„ is obtained by truncating 
equation (2) to K yielding 

p;=(a„_i,a„_2....,a„_j,), 0<it<L.(6) 

In this case, the number of reduced states is given by 
30 S' = 2'^. (7) 
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The results may be generalized to the cases given in P. R. Chevillat and E. Eleftheriou or M.V. 
Eyuboglu and S.U. Qureshi, referenced above. The branch metric for a transition from reduced 
state pn under input a„ is given by 

where 

In equation (9), u^ip'^^) is the decision-feedback for p; and a^-iiPn) is the symbol of the 
survivor path into state p'^ which corresponds to time As the first K survivor symbols 

(a^_i(p;),a„_2(Pn)--^n-/r(Pn)) ^om the survivor path into state p; are equal to the symbols 
(a^_i,a„_2 defining this state, equation (9) can be rewritten as 

«.(p;)=-Zm/.-^.-/(p;). (10) 

Among all paths entering reduced state p^+i from the 2 predecessor states {p^}, the most likely 
path with metric r^+i(p^+i) is chosen according to the ACS operation: 

r;+i(p;+i)= min^ (r;(p;)+A;(z,,a„,p;)). (ii) 

The state-parallel architecture for RSSE with the parameters L=4 and ^=1 is 
shown in FIG. 7A. It can be seen from FIG. 7A that the RSSE 700 architecture comprises four 
functional blocks, namely, a decision feedback unit (DFU) 710, a branch metrics unit (BMU) 
720, an add-compare-select unit (ACSU) 730 and a survivor memory unit (SMU) 740. As the 
corresponding reduced trellis is the same as the one in FIG. 3, the ACSU 730 shown in FIG. 7 
has the same architecture as the ACSU 420 given in FIG. 5. The part of the SMU 740 that stores 
the L'K survivor symbols (a;j_/^-i(Pn)'^/i-/^-2(Pn)v".^/i-L(Prt)) foi" each reduced state must be 
implemented in a register-exchange-architecture as described in R. Cypher and C.B. Shung, 
"Generalized trace-back techniques for survivor memory management in the Viterbi algorithm," 
J. VLSI Signal Processing, vol. 5, pp. 85-94, 1993, as these symbols are required for the 
evaluation of equation (9) in the DFU 710 without delay. Because of the binary modulation, the 
multipliers in the DFU 710 can be implemented using shifters (SHUTs). Look-up tables (LUTs) 
approximate the squaring function in equation (8) in the BMU, as defined by FIG. 7B. 
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RSSE 700 has less computational complexity than MLSE for the same channel 
memory L, as RSSE processes less states, at the expense of a significantly longer critical path. It 
can be seen from FIG. 7 that there is a recursive loop which comprises one SHIFT and L-K+l 
ADDs in the DFU 710 (the first term in the right hand side of equation (9) can be computed 
outside the loop), one LUT in the BMU 720, one add-compare in the ACSU 730 (which is 
roughly equal to two ADDs in terms of delay), and a 2-to-l MUX in the SMU 740. All these 
operations must be completed within one symbol period and cannot be pipelined. In contrast to 
this, the critical path in MLSE just comprises the ACS operation. Also, due to the different 
structure of the recursive loop in the RSSE 700, the block processing methods which have been 
developed to speed up the VA (see H. Meyr et al.. Digital Conununication Receivers, John Wiley 
& Sons, 568-569 (1998)) cannot be applied to increase the throughput of RSSE. Therefore, the 
maximum throughput of RSSE is potentially significantly lower than of MLSE. Furthermore, the 
throughput of RSSE depends on the channel memory such that it decreases for increasing L. 
FIG. 6 summarizes the comparison of MLSE and RSSE in terms of computational complexity 
and critical path. 

Pipelined RSSE 

It was suggested in E.F. Haratsch and K. Azadet, "High-speed reduced-state 
sequence estimation," Proc. IEEE Int. Symp. Circuits and Systems, May 2000, to precompute the 
branch metrics for all possible 2^ channel states p„ in a look-ahead fashion outside the critical 
loop. At each decoding step, the appropriate branch metrics are chosen based on past survivor 
symbols in the SMU. This approach removes the BMU and DFU out of the critical loop. 
However, the hardware increases exponentially with the channel memory L. Also the delay 
through the MUXs, which select the actual branch metrics among all precomputed ones, 
increases with L, eventually neutralizing the speed gain achieved by the precomputation. The 
present invention provides a technique that increases the throughput of RSSE by performing 
precomputations while only leading to a linear increase in hardware with respect to the look- 
ahead depth. 

One-Step Look- Ahead 

The hardware increase can be limited by taking advantage of past survivor 
symbols in the SMU and past decisions of the ACSU. This will be shown for precomputations 
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with look-ahead depth one, i.e. possible values for branch metrics needed by the ACSU at time n 
are already computed at time n - 1 . 

A partial decision-feedback for reduced state p; could be calculated by using the 

L-1 survivor symbols (a„-2(P^i-i)'^«-3(P«-i)->^n-L(Pn-i)) corresponding to the survivor sequence 
into p;_i : 

v„(p;-i)=-xf=2/A-/(p;-i)- (12) 

Note, that the K survivor symbols (a;,-2(p;-i),^„-3(Pn-i). -^n-j^-i(Pn-i)) need not to be fed back 
from the SMU, as they are equal to the symbols defining the state p^.j (c.f. equation (6)). 
Therefore, these symbols and their contribution to the partial decision-feedback v„(p;_i) are 
fixed for a particular state p^_i . 

If denotes a possible extension of the sequence (a^_2(Pn-\)^^n-3(Pn-0'-^^n-L(Pn-0)y the 
corresponding tentative decision-feedback is given by 

Wn iPn-\ '^n-l ) = (Pn-\ ) " f\^n-\ ' (13) 

and the tentative branch metric under input a„is 

K i^n^^n ' Pn-\ ^^n-\ ) = (^n " «n + (Pn-\^^n-\ (14) 

The actual branch metric corresponding to survivor paths into state p^ and input can be 
selected among the tentative branch metrics based on the past decision d^.^ip'^^i = p^) according 
to 

K i^n ^an^Pn)=^^^(^n(^n^^n^ Pn )' ^/i-l (Pn-1 = Pn )) ' ( 1 5) 

where A^(zn.a^.Pn) is the vector containing the two tentative branch metrics Ki^n^^n^Pn-v^n-O 
for input a„ and the two possible sequences into p^ from the different predecessor states {p^-il : 

KiZn.a^.Pn) = {K(Zn^^n^Pn-\^^n-\)) ^{pn-\}^ Pn • (16) 

The branch metrics, which have been selected using equation (15), are used for the ACS 
operation according to equation (11). As equations (12), (13), (14), (15) and (16) can already be 
evaluated at time n-l, they are decoupled from the ACS operation according to equation (11) at 
time n. This leads to an architecture that can achieve a potentially higher throughput than the 
conventional RSSE implementation. The look-ahead architecture for the RSSE 700 of FIG. 7 
(i.e. L=4 and K=l) is shown in FIG. 8. It can be seen that the long critical path in the architecture 
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of FIG. 7 is broken up into two smaller critical paths, as a pipeline stage 825 is placed in front of 
the ACSU 830. The processing speed of this architecture still depends on the channel memory, 
as the number of additions and thus the delay along the critical path in the DFU 810 increases 
with L. In the following, a pipelined RSSE architecture is discussed whose maximum 
throughput does not depend on L. 

Multiple-Step Look-Ahead 

The process of precomputing branch metrics which are needed at time n could 
already be started at time ai-M , where M s[l;L-K], A partial decision-feedback corresponding 
to the survivor sequence (a„-A/-i(Pn-M)'^n-A^-2(Pn-Af )» -^n-/.(P^i-Af )) p'^.^ is given by 

V. (Pn-M ) = -Sf=M+l Mn-i(Pn-M ) • (17) 

It is again noted that the K survivor symbols {dn-M'\(Pn-M)^^n-M-2(Pn-M)^-^^n-M-K(Pn-M)) 
identical to the symbols defining the state p'^_^ , and thus their contribution to v^ip'^.^) is fixed 
for this particular state. A tentative partial decision-feedback for a sequence starting with 
ian-M-\(Pn-M)^^n-M-2(Pn-M)^''^^^^^ and which is extended by a^.^ can be precomputed 

as 

iPn-M » ^n-M ) = (Pn~M ) " /m ^n-M • ( 1 8) 

When the decision d^^MiPn-M = Pn-M+\) becomes available, the partial decision-feedback, which 

corresponds to the survivor sequence (a^-MiPn-M+i)^^n-M-\(Pn-M+i) ^n-iiPn-M+i))^ be 

selected among the precomputed ones: 

Vn(Pn-M^\) = 

sel{U niPn-M+O^^n-M (Pn-M = Pn-M-i-l )), (19) 

where U^ipn-M+O is the vector containing the two precomputed tentative partial decision- 
feedback values for the two possible path extensions into pn-M+\ from the different predecessor 
states {p;_yi^ } : 

^n(Pn-MH) = 

{^niPn-M »««-A^ )} ' {Pn-M ) ^ P/i-M+1 • (20) 

To be able to eventually precompute tentative branch metrics according to equation (14), the 
computations described by equations (18), (19) and (20) must be repeated for time steps 

rt - M + 1 to w - 1 according to the following equations, where M-\>k>\\ 
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W/i (Pn-k ' ^n-Jt ) = (Pn-k ) " fk^n-k > (21) 
Vn(Pn-Jt+l) = 

seliU niPn-k+O^^n-kiPn-k = Pn-Jt+1 )), (22) 

5 {"/i(Pn-Jt » {Pn-k } Pn-Jt+l • (23) 

Once Sn(Pn-\^Sn-\) bccomes available, tentative btanch mctrics A;(z„,a„,p;_i,a;,_i) can be 
precomputed according to equation (14) and the appropriate branch metrics are selected 
according to equations (15) and (16). 

The architecture for RSSE 900 with look-ahead depth M=3 and the parameters 
10 L=4 and K=l is shown in FIG. 9, It can be seen that in total M=3 pipeline stages are available. 
Two pipeline stages 912, 916 have been placed inside the DFU 910, and one pipeline stage 925 

O 

5 has been placed between the BMU 920 and ACSU 930. The connection network in the DFU 910 
.-V{ resembles the structure of the underiying trellis from FIG. 3, as past decisions from the ACSU 
:g 930 are used to extend the partial survivor sequences by the subsequent survivor symbol. As the 
ill 15 LUT typically has a delay comparable to an adder, the critical path of this implementation is 
determined by an add-compare in the ACSU 930 (2 ADDs) and the storage of the most recent 
? decision in the SMU 940 or the selection of an appropriate value with a 2-to-l MUX in the DFU 
1^ 910 or BMU 920. 

K J 5 

g The complexity and critical path of pipelined RSSE using multiple-step look- 

20 ahead computations is shown in FIG. 10. It can be seen in FIG. 10 that the hardware overhead 
for performing precomputations scales only linearly with the look-ahead depth M. Choosing 
M=L'K as in FIG. 9 leads to an architecture where the critical path is reduced to be of the same 
order as in MLSE (c.f. FIG. 6) and does not depend on the channel memory L. In a further 
variation, the precomputed partial ISI estimates in the pipelined DFU 910 may be processed in 
25 groups of taps, with a precomputation for all possible data symbol combinations corresponding 
to the groups of taps and selection for each group, as would be apparent to a person of ordinary 
skill in the art. 

FIGS. 11 and 12 illustrate alternate implementations of the RSSEs shown in 
FIGS. 8 and 9, respectively, where pipeline registers are placed differently (now before Mux at 
30 stage 1125 and 1225, respectively) using a cut-set or re-timing transformation technique. For a 
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more detailed discussion of cut-set or re-timing transformation techniques, see P. Pirsch, 

Architectures for Digital Signal Processing, New York, Wiley (1998), incorporated by reference 

herein. The present invention encompasses all derivations that can be achieved using such 

transformations, as would be apparent to a person of ordinary skill in the art. 

JOINT POSTCURSOR EQUALIZATION AND TRELUS DECODING FOR lOOOBASE-T 

GIGABIT ETHERNET 

An exemplary embodiment employs the lOOOBASE-T physical layer standard that 
specifies Gigabit Ethernet over four pairs of Category 5 unshielded twisted pair (UTP) copper 
cabling, as described in M. Hatamian et al., "Design Considerations for Gigabit Ethernet 
lOOOBase-T Twisted Pair Transceivers," Proc. IEEE Custom Integrated Circuits Conf. (CICC), 
Santa Clara, CA, 335-342 (May 1998); or K. Azadet, "Gigabit Ethernet Over Unshielded 
Twisted Pair Cables," Proc. Int. Symp. VLSI Technology, Systems, Applications (VLSI-TSA), 
Taipei, 167-170 (June 1999). It is noted that hereinafter, all variables will be defined in a new 
way. Although the meaning of variables used in this second part of this detailed description may 
be related to the definition of the variables in the previous part of the description, they might not 
exactly have the same meaning. All variables used hereinafter, however, will be described and 
defined in a precise way and their meaning is valid for this second part of the detailed description 
only. 

The throughput of IGb/s is achieved in lOOOBASET by full duplex transmission 
of pulse amplitude modulated signals with the five levels {-2, -1, 0, 1, 2} (PAM5) resulting in a 
data rate of 250Mb/s per wire pair. By grouping four PAM5 symbols transmitted over the four 
different wire channels, a four-dimensional (4D) symbol is formed which carries eight 
information bits. 

Thus, the symbol rate is 125Mbaud/s, which corresponds to a symbol period of 
8ns. To achieve a target bit error rate of at less than 10'^^, the digital signal processor (DSP) 
section of a lOOOBASE-T receiver must cancel intersymbol interference (ISI), echo and near-end 
crosstalk (NEXT), lOOOBASE-T improves the noise margin by employing trellis-coded 
modulation (TCM). For a detailed discussion of trellis-coded modulation techniques, see, for 
example, G. Ungerboeck, "Trellis-Coded Modulation With Redundant Signal Sets, Parts I and 
n," IEEE Commun. Mag., Vol. 25, 5-21 (Feb. 1987), incorporated by reference herein. 
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For coding purposes, the ID PAM5 symbols are partitioned into two one 
dimensional (ID) subsets A={-1,1} and B={-2,0,2}. By grouping different combinations of the 
ID subsets together which are transmitted over the four wire pairs, the eight 4D subsets SO, 
SI,..., S8 are formed. The 8-state, radix-4 code trelhs specified by lOOOBASE-T is shown in 
FIG. 13. in Fig. 13 denotes the state of the trellis code at time n (i.e., p„ is no longer 
defined by equation (2), as noted at the beginning of this section). Each transition in the trellis 
diagram 1300 corresponds to one of the specified eight 4D subsets. There are 64 parallel 
transitions per state transition. Due to the 4D subset partitioning and labeling of the transitions in 
the code trellis, the minimum Euclidean distance between allowed sequences is A^=4 which 
corresponds to an asymptotic coding gain of 6 dB (101og4) over uncoded PAM5 in an ISI free 
channel. 

In a lOOOBASE-T receiver, feedforward equalizers, echo and NEXT cancellers 
remove precursor ISI, echo and NEXT respectively. The remaining DSP processing removes the 
postcursor ISI, which typically spans 14 symbol periods, and decodes the trellis code. It has been 
shown in E.F. Haratsch, "High-Speed VLSI Implementation of Reduced Complexity Sequence 
Estimation Algorithms With Application to Gigabit Ethernet lOOOBASE-T," Proc. Int. Symp. 
VLSI Technology, Systems, Applications (VLSI-TSA), Taipei, 171-174 (June 1999) that parallel 
decision-feedback decoding, a special case of reduce-state sequence estimation, M.V. Eyuboglu 
and S.U. Qureshi, "Reduced-State Sequence Estimation for Coded Modulation on Intersymbol 
Interference Channels," IEEE J. Sel. Areas Commun., Vol. 7, 989-95 (Aug. 1989), offers the best 
trade-off for this task with respect to SNR performance, hardware complexity and critical path. 
However, the integration of a 125MHz, 14-tap parallel decision-feedback decoder (PDFD) is 
quite challenging because of the critical path problem. 

A simplified postcursor equalization and trellis decoding structure was presented 
in E.F. Haratsch and K. Azadet, "A Low Complexity Joint Equalizer and Decoder for 
lOOOBASE-T Gigabit Ethernet," Proc. IEEE Custom Integrated Circuits Conf. (CICC), Orlando, 
465-68 (May 2000), where decision-feedback prefilters shorten the postcursor impulse response 
to one postcursor. Then exhaustive precomputation of all possible ID branch metrics is possible, 
substantially reducing the critical path of the remaining 1-tap PDFD. However, the postcursor 
equalization and trellis decoding structure suffers from a performance degradation of 1.3dB 
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compared to a 14-tap PDFD. 

The present invention thus provides a pipelined 14-tap PDFD architecture, which 
operates at the required processing speed of 125MHz without any coding gain loss. To achieve 
this, the look-ahead technique discussed above for uncoded signals impaired by ISI, where the 
underlying trellis has no parallel state transitions, is extended to trellis codes with parallel 
transitions like the one specified in lOOOBASE-T. The processing blocks of the disclosed 
architecture, which differ from a conventional PDFD design, are described below. 

Parallel Decision-Feedback Decoding Algorithm 

Parallel decision-feedback decoding combines postcursor equalization with TCM 
decoding by computing separate ISI estimates for each code state before applying the well known 
Viterbi algorithm (see, e.g., G. D. Forney, Jr., 'The Viterbi Algorithm," Proc. IEEE, Vol. 61, 
268-78 (Mar. 1973)) to decode the trellis code. An ISI estimate for wire pair; and code state p„ 
at time n is given by 

where {fij} are the postcursor channel coefficients for wire pair; and a„^ij(Pn) is the ;-th 
dimension of the 4D survivor symbol a^^iiPn) = ian-iAiPnl^n-u2(Pnl^^^^^ which 
belongs to the survivor sequence into p„ and corresponds to time n-i. As there are eight code 
states and four wire pairs, 32 ISI estimates are calculated at each decoding step. In a straight- 
forward PDFD implementation, the calculation of the ISI estimates in the decision-feedback unit 
introduces a recursive loop, which also includes the branch metric unit (BMU), add-compare- 
select unit (ACSU) and survivor memory unit (SMU). As the clock rate is 125MHz in 
lOOOBASE-T, there are only 8ns available for the operations along this critical path. As 
conventional pipelining techniques cannot be applied to improve throughput due to the recursive 
nonlinear structure of this loop, it is extremely challenging to implement a 125MHz, 14-tap 
PDFD for lOOOBASE-T Gigabit Ethernet. When a state-parallel 14-tap PDFD is implemented 
using VHDL and synthesis in 3.3V 0.16^m standard cell CMOS process, the design would only 
achieve a throughput of approximately 500Mb/s, and the hardware complexity would be 
158kGates. In the following, a pipelined 14-tap PDFD architecture is disclosed that achieves the 
required throughput of IGb/s without sacrificing coding gain performance. 
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Pipelined 14-Tap PDFD Architecture 

The parallel decision-feedback decoding algorithm was reformulated above such 
that pipelining of the computation of the ISI estimates and branch metrics is possible. ISI 
estimates and branch metrics are precomputed in a look-ahead fashion to bring the DFU and 
BMU out of the critical loop (see FIGS. 8 and 9 and corresponding discussion). Using ACS 
decisions to prune the look-head computation tree mitigates the exponential growth of the 
computational complexity with respect to the look-ahead depth. The above discussion only 
addressed the case where parallel decision-feedback decoding or other RSSE variants are used 
for equalization (and trellis decoding) of signals impaired by ISI, where the underlying trellis has 
no parallel state transitions. 

In the following discussion, the look-ahead computation concept discussed above 
is extended to systems where the parallel decision-feedback decoding algorithm or other RSSE 
variants are used for equalization and/or trellis decoding where the underlying trellis has parallel 
state transitions. In particular, an exemplary pipelined, 14-tap PDFD architecture with look- 
ahead depth two is presented which meets the throughput requirement of lOOOBASE-T. The 
present invention can be generalized to other look-ahead depths, trellis codes, modulation 
schemes, RSSE variants and number of postcursor taps, as would be apparent to a person of 
ordinary skill in the art. 

The disclosed pipelined PDFD architecture, which decodes the lOOOBASE-T 
trellis code and equalizes the ISI due to 14 postcursors is shown in FIG. 14. Speculative ISI 
estimates which are used for the ACS decisions corresponding to state transitions {Prt+2}->{Pn+3} 
are computed in the look-ahead DFU (LA-DFU) 1412 using information already available at time 
n, i.e., two clock cycles ahead of time. Therefore, the look-ahead depth is two. The appropriate 
ISI estimates are selected in the ISI-multiplexer unit (ISI-MUXU) 1416 based on ACS decisions 
(from 1440) and survivor symbols (from 1450). Speculative ID branch metrics are precomputed 
one decoding step in advance in the ID-LA-BMU 1424. Again, ACS decisions and survivor 
symbols are used to select the appropriate ID branch metrics in the ID-BM-MUXU 1428. The 
selected ID branch metrics are added up in the 4D-BMU 1430 to compute the 4D branch 
metrics, which correspond to state transitions of the code trellis 1300 shown in FIG. 13. The best 
survivor path for each code state is determined in the ACSU 1440, and the eight survivor paths 
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are stored in the SMU 1450. 

Compared to a conventional PDFD implementation as described in E. F. Haratsch, 
"High-Speed VLSI Implementation of Reduced Complexity Sequence Estimation Algorithms 
With Application to Gigabit Ethernet 1000 Base-T," Infl Symposium on VLSI Technology, 
Systems, and Applications, Taipei (Jun. 1999), the DFU and ID-BMU are outside the critical 
loop, as there is a pipeline stage 1418 between the DFU and ID-BMU and another pipeline stage 
1429 between the ID-BMU and 4D-BMU. The critical path in the architecture of FIG. 14 
includes only the 4D-BMU 1430, ACSU 1440 and SMU 1450. The contribution of the ID-BM- 
MUXU 1428 and ISI-MUXU 1416 to the critical path is low. Therefore, the proposed PDFD 
architecture achieves a throughput twice as high as a conventional PDFD implementation. The 
proposed PDFD architecture differs from the pipelined structure developed for trellises without 
parallel state transitions in HGS. 8 and 9 with respect to the selection of the appropriate ISI 
estimates and ID branch metrics in the ISI-MUXU 1416 and ID-BM-MUXU 1428. As 
lOOOBASE-T employs TCM with parallel state transitions, not only ACS decisions, but also the 
most recent survivor symbols are required for the selection of the appropriate values as there is 
not a unique relationship between ACS decisions and survivor symbols. In the following, the 
implementation of the DFU 1410, ID-BMU 1420 and SMU 1450 are described in detail. The 
implementation of the 4D-BMU 1430 and ACSU 1440 is the same as in a conventional PDFD 
and is already described in E.F. Haratsch and K. Azadet, ""A Low Complexity Joint Equalizer 
and Decoder for lOOOBASE-T Gigabit Ethernet," Proc. IEEE Custom Integrated Circuits Conf. 

(CICC), Orlando, 465-468 (May 2000). 

Decision-Feedback Unit 
Exhaustive precomputation of ISI estimates is not feasible in lOOOBASE-T 
without prefiltering as the number of possible ISI estimates grows exponentially with the number 
of postcursors. As there are 14 postcursors, four wire pairs and PAM5 modulation is being used, 
there are in total 4x5>'' =2xio'" possible ISI estimates, which must be precomputed. 
Precomputing ISI estimates using a limited look-ahead depth reduces the complexity. The 
exponential growth of the number of precomputed ISI estimates is mitigated as the 
precomputation is not completely decoupled from the ACS and survivor symbol decisions. 
Survivor symbols available at time n are used for the computation of ISI estimates corresponding 
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to state transitions {p„+2}^{Pn+3}- Then, the look-ahead computation tree is pruned using ACS 
and survivor symbol decisions available at time n. 

Look- Ahead Computation of ISI Estimates (LA-DFU) 

An estimate Vn+xj(Pn) the partial ISI due to the channel coefficients 
[fxj.hj.-Juj] which corresponds to a state transition p„+2 ^P«+3 can be calculated by using the 
symbols from the survivor path into state p„ which are available at time n: 

A speculative partial ISI estimate u„^2j(Pn^^nj) ^ which also considers the ISI due 
to f2j and assumes that 5„j is the ID symbol for the corresponding transition p„ -> p„+i is 
calculated as 

«/i+2,y (P/i » ^nj ) = ^n+2J (Pn ) " f2J^nJ • 

As there are five possibilities for a^j due to the PAM5 modulation, five different 
partial ISI estimates must be computed per code state and wire pair in the LA-DFU 1412 as 
shown in FIG. 15. In total, 160 (8x4x5) such ISI estimates are precomputed in the LA-DFU 
1412. 

Selection of ISI Estimates (ISI-MUXU) 

The appropriate partial ISI estimate v„+2,y(Pn+i) which considers the symbols from 
the survivor path into p„+i and the channel coefficients {fzjJsj^'-Juj) can be selected among 
the precomputed partial ISI estimates u„^2j(Pn^^nj) when the best survivor path into state p„+i 
and the corresponding 4D survivor symbol a^jiPn-^-O become available. This selection in the ISI- 
MUXU 1416 is shown in FIG. 16 for a particular wire pair ; and state p„+i = 0. The partial ISI 
estimate v„^2,;(Pn+i =0) is selected among 20 (4x5) precomputed partial ISI estimates 
{^n^2jiPn>^nj)) ' {P«)-> P/i+1 = 0 , as thcrc are the four contender paths from the states p„ =0,2,4 and 
6 leading into state p„+i = o. Also, for each of these contender paths leading into state p„+i, five 
different partial ISI estimates U„^2j(Pn^^nj) corresponding to different values for a„j are 
possible. As shown in FIG. 16, the selection of the appropriate partial ISI estimate v„+2,y (Pn+i) is 
performed in two stages. First, the ACS decision i;„(p„+,) selects the five speculative partial ISI 



-19- 



Azadet 13-5 

estimates, which correspond to the selected survivor path into p„+, , but assume different values 
for a„j. Then, the survivor symbol a„j(p„^y) selects the appropriate partial ISI estimate 
W;(Pn+i). which assumed a„.;(p„+,) as value for.5„,;. Both d„(p„+i) and a„j(p„+i) become 
available at the end of the clock cycle corresponding to state transitions {p„ ) -» |p„+, } . The output 
of the ISI-MUXU is 32 (8x4) partial ISI estimates v„,2., (p„+,), as there are eight states and four 
wire pairs. 

ID Branch Metric Unit 
The ID-BMU 1420 consists of two processing blocks. The ID-LA-BMU 1424 
takes the partial ISI estimates {v„^,j(p„)) computed in the DFU to calculate speculative ID branch 
metrics. In the ID-BM-MUXU 1428, the appropriate ID branch metrics are selected using ACS 
decisions and corresponding survivor symbols. 

Look-ahead computation of ID branch metrics (ID-LA-BMU) 
The ID-LA-BMU 1424 precomputes speculative ID branch metrics which are 
then needed in the 4D-BMU 1430 one clock cycle later. Input into the ID-LA-BMU 1424 are 
the partial ISI estimates {v„+i,y(p„)) , which correspond to trellis transitions {p„+i}->{p„+2} • These 

ISI estimates consider the channel coefficients {/a.;, /a,, fuj) and the symbols from the survivor 

path into state p„ . A speculative partial ISI estimate ii„+i,;(P„.5„,;) , which also considers the ISI 
due to the channel coefficient and assumes that a„j is the ID symbol corresponding to the 
transition p„ p„+i is given by: 

The speculative ID branch metric for a transition from state p„+i under the symbol 
a„^^j assuming that the corresponding survivor path contains the survivor sequence into state p„ 
and is extended by the symbol 3„j to reach state p„+i is given by 

^n+Xj (Zn+1,J • "n+1,y <Pn'^n,j) = 

(Zn+lj -fln+l,; +«n+l.;(Pn'an,;)) • 

The precomputation of speculative ID branch metrics for a particular initial state 
p„ and wire pair; is shown in FIG. 17, where the slicers calculate the difference between the 
slicer input and the closest symbol in the ID subsets A and B, respectively. As there are four 
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wire pairs, eight code states, five possibilities for a„ j (due to the PAM5 modulation), and two 

possibilities for a„+i,; (A-type or 5-type ID symbol), in total 320 (8x4x5x2) different speculative 

ID branch metrics are precomputed in the ID-LA-BMU 1424. 

Selection of ID branch metrics (ID-BM-MUXU) 

The appropriate ID branch metric A„+i,y(z„+i.j,a„+i,y.p„+i) which corresponds to a 
transition from state p„+, under the ID symbol a„+,,y is selected among 4x5=20 precomputed ID 
branch metrics (z„+i.y,a„+i,;,P„.fl„,;) as there are four path extensions from different states {p„} 
into p„+, and five possibilities for S„j due to the PAM5 modulation. The selection of a particular 
A„+i,j(z„+i.;,««+i.7.Pn+i) in the ID-BM-MUXU 1428 is performed using the same multiplexer 
structure as shown in FIG. 16. First, the ACS decision d„(p„^i) determines the five speculative 
ID branch metrics, which correspond to the state p„ being part of the survivor path into p„+, . 
Then, the survivor symbol a„,;(p„+i) selects among these five metrics the one which assumed 
a„.;(P«+i) as value for a„j . The ID-BM-MUXU 1428 selects in total 64 (8x4x2) actual ID branch 
metrics, as there are eight states, four wire pairs and the two ID subset types A and B. 

Survivor Memory Unit 

The merge depth of the exemplary lOOOBASE-T trellis code is 14. The SMU 
must be implemented using the register-exchange architecture described in R. Cypher and C.B. 
Shung, "Generalized Trace-Back Techniques for Survivor Memory Management in the Viterbi 
Algorithm," J. VLSI Signal Processing, Vol. 5, 85-94 (1993), as the survivor symbols 
corresponding to the time steps n-\2,n-u,...,n are needed in the DFU without delay and the 
latency budget specified in the lOOOBASE-T standard is very tight. The proposed register- 
exchange architecture with merge depth 14 is shown in FIG. 18, where only the first row storing 
the survivor sequence corresponding to state zero is shown. 5x„(P«) denotes the 4D symbol 
decision corresponding to 4D subset SX and a transition from state p„ . The multiplexers in the 
first column select the 4D survivor symbols {a„(p„+i)} , which are part of the survivor path into 
{p„+i}- These 4D survivor symbols are required in the ISI-MUXU 1416 and ID-BM-MUXU 
1428 to select the appropriate partial ISI estimates and ID branch metiics, respectively. The 
survivor symbols (a„.,(p„).£„.2(p„),....a„.,2(p„)} which are stored in the registers corresponding to 
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the first, second, ... 12th column are used in the LA-DFU 1412 to compute the partial ISI 
estimates v„+2.y(Pn)- 

It is to be understood that the embodiments and variations shown and described 
herein are merely illustrative of the principles of this invention and that various modifications 
may be implemented by those skilled in the art without departing from the scope and spirit of the 
invention. 
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